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Abstract 

Each day, anti- virus companies receive tens of thou- 
sands samples of potentially harmful executables. 
Many of the malicious samples are variations of 
previously encountered malware, created by their 
authors to evade pattern-based detection. Dealing 
with these large amounts of data requires robust, 
automatic detection approaches. 
This paper studies malware classification based on 
call graph clustering. By representing malware sam- 
ples as call graphs, it is possible to abstract cer- 
tain variations away, and enable the detection of 
structural similarities between samples. The ability 
to cluster similar samples together will make more 
generic detection techniques possible, thereby tar- 
geting the commonalities of the samples within a 
cluster. 

To compare call graphs mutually, we compute pair- 
wise graph similarity scores via graph matchings 
which approximately minimize the graph edit dis- 
tance. Next, to facilitate the discovery of similar 
malware samples, we employ several clustering al- 
gorithms, including k-medoids and DBSCAN. Clus- 
tering experiments are conducted on a collection of 
real malware samples, and the results are evaluated 
against manual classifications provided by human 
malware analysts. 

Experiments show that it is indeed possible to ac- 
curately detect malware families via call graph clus- 
tering. We anticipate that in the future, call graphs 
can be used to analyse the emergence of new mal- 
ware families, and ultimately to automate imple- 
mentation of generic detection schemes. 

KEYWORDS: Call Graph, Clustering, DBSCAN, 
Graph Edit Distance, Graph Matching, fc-mcdoids 
Clustering, Vertex Matching 
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1 Introduction 



Tens of thousands of potentially harmfull executa- 
bles are submitted for analysis to data security 
companies on a daily basis. To deal with these 
vast amounts of samples in a timely manner, au- 
tonomous systems for detection, identification and 
categorization are required. However, in practice 
automated detection of malware is hindered by code 
obfuscation techniques such as packing or encryp- 
tion of the executable code. Furthermore, cyber 
criminals constantly develop new versions of their 
malicious software to evade pattern-based detection 
by anti- virus products [31] , 

For each sample a data security company receives, 
it has to be determined whether the sample is 
malicious or has been encountered before, possi- 
bly in a modified form. Analogous to the human 
immune system, the ability to recognize common- 
alities among malware which belong to the same 
malware family would allow anti-virus products to 
proactively detect both known samples, as well 
as future releases of the malware samples from 
the family. To facilitate the recognition of simi- 
lar samples or commonalities among multiple sam- 
ples which have been subject to change, a high-level 
structure, i.e. an abstraction, of the samples is re- 
quired. One such abstraction is the call graph. A 
call graph is a graphical representation of a binary 
executable in which functions are modeled as ver- 
tices, and calls between those functions as directed 
edges 30 . 

This paper deals with mutual comparisons of mal- 
ware via their call graph representations, and the 
classification of structurally similar samples into 
malware families through the use of clustering al- 
gorithms. So far, only a limited amount of research 
has been devoted to malware classification and iden- 
tification using graph representations. Flake 
and later Dullien and Rolles [8] describe approaches 
to finding subgraph isomorphisms within control 
flow graphs, by mapping functions from one flow 
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graph to the other. Functions which could not be 
reliably mapped have been subject to change. Via 
this approach, the authors of both papers can for 
instance reveal differences between versions of the 
same executable or detect code theft. Additionally, 
the authors of 8 suggest that security experts could 
save valuable time by only analyzing the differences 
among variants of the same malware. 
Preliminary work on call graphs specifically in the 
context of malware analysis has been performed by 
Carrera and Erdelyi [5]. To speed up the process 
of malware analysis, Carrera and Erdelyi use call 
graphs to reveal similarities among multiple mal- 
ware samples. Furthermore, after deriving similar- 
ity metrics to compare call graphs mutually, they 
apply the metrics to create a small malware tax- 
onomy using a hierarchical clustering algorithm. 
Briones and Gomez [3] continued the work started 
by Carrera and Erdelyi. Their contributions mainly 
focus on the design of a distributed system to com- 
pare, analyse and store call graphs for automated 
malware classification. Finally the first large scale 
experiments on malware comparisons using real 
malware samples were recently published in [TTII2TJ] . 
Additionally, the authors of [IT] describe techniques 
for efficient indexing of call graphs in hierarchi- 
cal databases to support fast malware lookups and 
comparisons. 

In this paper we explore the potentials of call 
graph based malware identification and classifica- 
tion. First call graphs are introduced in more detail 
as well as graph similarity metrics to compare mal- 
ware via their call graph representations in Sections 
2 and 3. At the basis of call graph comparisons lay 
graph matching algorithms. Exact graph match- 
ings are expensive to compute, and hence we resort 
to approximation algorithms (Sections 3, 4). Fi- 
nally, in Section 5, the graph similarity metrics are 
used for automated malware classification via clus- 
tering algorithms on a collection of real malware 
call graphs. A more extensive report on the work is 
available in [20] . 

2 Introduction to Call Graphs 

A call graph models a binary executable as a di- 
rected graph whose vertices, representing the func- 
tions the executable is composed of, are inter- 
connected through directed edges which symbolize 
function calls 30 . A vertex can represent either 
one of the following two types of functions: 

1. Local functions, implemented by the program 
designer. 

2. External functions: system and library calls. 

Local functions, the most frequently occurring func- 
tions in any program, are written by the program- 
mer of the binary executable. External functions, 




Figure 1: Example of a small malware call graph. 
Function names starting with 'sub' denote local 
functions, whereas the remaining functions are ex- 
ternal functions. 

such as system and library calls, are stored in a li- 
brary as part of an operating system. Contrary to 
local functions, external functions never invoke lo- 
cal functions. 

Analogous to [T7J, call graphs are formally defined 
as follows: 

Definition 1. (Call Graph): A call graph is a 
directed graph G with vertex set V=V(G), repre- 
senting the junctions, and edge set E=E(G), where 
E(G) C V(G)x V(G), in correspondence with the 
function calls. 

Call graphs are generated from a binary exe- 
cutable through static analysis of the binary with 
disassembly tools [S]. First, obfuscation layers are 
removed, thereby unpacking and, if necessary, de- 
crypting the executable. Next, a disassembler like 
IDA Pro [15] is used to identify the functions and 
assign them symbolic names. Since the function 
names of user written functions are not preserved 
during the compilation of the software, random 
yet unique symbolic names are assigned to them. 
External functions, however, have common names 
across executables. In case an external function is 
imported dynamically, one can obtain its name from 
the Import Address Table (IAT) [3S1 [23]. When, 
on the other hand, a library function is statically 
linked, the library function code is merged by the 
compiler into the executable. If this is the case, 
software like IDA Pro's FLIRT 16 has to be used 
to recognize the standard library functions and to 
assign them the correct canonical names. Once all 
functions, i.e. the vertices in the call graph, are iden- 
tified, edges between the vertices are added, corre- 
sponding to the function calls extracted from the 
disassembled executable. 

3 Graph Matching 

3.1 Graph matching techniques 

Detecting malware through the use of call graphs re- 
quires means to compare call graphs mutually, and 
ultimately, means to distinguish call graphs repre- 
senting benign programs from call graphs derived 
from malware samples. Mutual graph comparison 
is accomplished with graph matching. 
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Definition 2. (Graph matching): For two graphs, 
G and H , of equal order, the graph matching prob- 
lem is concerned with finding a one-to-one mapping 
(bijection) <fi : V(G) — > V(H) that optimizes a cost 
function which measures the quality of the mapping. 

In general, graph matching involves discovering 
structural similarities between graphs |27| through 
one of the following techniques: 

1. Finding graph isomorphisms 

2. Detecting maximum common subgraphs 
(MCS) 

3. Finding minimum graph edit distances (GED) 

An exact graph isomorphism for two graphs, G 
and H, is a bijective function f(v) that maps the 
vertices V(G) to V(H) such that for all i, j <E V{G), 
(i,j) G E(G) if and only if (f(i)J(j)) e E(H) [35]. 
Detecting the largest common subgraph for a pair 
of graphs is closely related to graph isomorphism as 
it attempts to find the largest induced subgraph of 
G which is isomorphic to a subgraph in H. Conse- 
quently, one could interpret an exact graph isomor- 
phism as a special case of MCS, where the common 
subgraph encompasses all the vertices and edges in 
both graphs. Finally, the last technique, GED, cal- 
culates the minimum number of edit operations re- 
quired to transform graph G into graph H. 

Definition 3. (Graph edit distance): The graph 
edit distance is the minimum number of elemen- 
tary operations required to transform a graph G into 
graph H. A cost is defined for each edit operation, 
where the total cost to transform G into H equals 
the edit distance. 

Note that the GED metric depends on the choice 
of edit operations and the cost involved with each 
operation. Similar to [371 |2"T1 [T7] , we only consider 
vertex insertion/deletion, edge insertion/deletion 
and vertex relabeling as possible edit operations. 
We can now show that the MCS problem can be 
transformed into the GED problem. Given is the 
shortest sequence of edit operations ep which trans- 
forms graph G into graph H, for a pair of unlabeled, 
directed graphs G and H. Apply all the necessary 
destructive operations, i.e. edge deletion and ver- 
tex deletion, on graph G as prescribed by ep. The 
maximum common subgraph of G and H equals 
the largest connected component of the resulting 
graph. Without further proof, this reasoning can 
be extended to labeled graphs. 

For the purpose of identifying, quantifying and 
expressing similarities between malware samples, 
both MCS and GED seem feasible techniques. Un- 
fortunately, MCS is proven to be an NP-Complete 
problem [14] , from which the NP-hardness of GED 
optimization follows by the prevous argument (The 



latter result was first proven in [37] by a reduction 
from the subgraph isomorphism problem). Since 
exact solutions for both MCS and GED are compu- 
tationally expensive to calculate, a large amount of 
research has been devoted to fast and accurate ap- 
proximation algorithms for these problems, mainly 
in the field of image processing [T3] and for bio- 
chemical applications [35] [35J. The remainder of 
this Subsection serves as a brief literature review of 
different MCS and GED approximation approaches. 
A two-stage discrete optimization approach for 
MCS is designed in [T2]. In the first stage, a greedy 
search is performed to find an arbitrary common 
subgraph, after which the second stage executes a 
local search for a limited number of iterations to 
improve upon the graph discovered in stage one. 
Similarly to [12], the authors of [35] also rely on 
a two-stage optimization procedure, however con- 
trary to [H] , their algorithm tolerates errors in the 
MCS matching. A genetic algorithm approach to 
MCS is given in [33]. Finally, a distributed tech- 
nique for MCS based on message passing is provided 

A survey of three different approaches to perform 
GED calculations is conducted by Neuhaus, Riesen, 
et. al. in [27] [23 HI]- They first give an exact 
GED algorithm using A* search, but this algo- 
rithm is only suitable for small graph instances [51] . 
Next, A*-Beamsearch, a variant of A* search which 
prunes the search tree more rigidly, is tested. As 
is to be expected, the latter algorithm provides 
fast but suboptimal results. The last algorithm 
they survey uses Munkres' bipartite graph matching 
algorithm as an underlying scheme. Benchmarks 
show that this approach, compared to the A*-search 
variations, handles large graphs well, without af- 
fecting the accuracy too much. In [TS], the GED 
problem is formulated as a Binary Linear Program, 
but the authors conclude that their approach is not 
suitable for large graphs. Nevertheless, they de- 
rive algorithms to calculate respectively the lower 
and upper bounds of the GED in polynomial time, 
which can be deployed for large graph instances as 
estimators of the exact GED. Inspired by the work 
of Justice and Hero in [Tg] , the authors of [37J devel- 
oped new polynomial algorithms which find tighter 
upper and lower bounds for the GED problem. 

3.2 Graph similarity 

In general, malware consists of multiple compo- 
nents, some of which are new and others which are 
reused from other malware [S] . The virus writer will 
test his creations against several anti-virus prod- 
ucts, making modifications along the way until the 
anti- virus programs do not recognize the virus any- 
more. Furthermore, at a later stage the virus writer 
might release new, slightly altered, versions of the 
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same virus [4j [32] . 

In this Section, we will describe how to determine 
the similarity between two malware samples, based 
on the similarity u(G, H) of their underlying call 
graphs. As will become evident shortly, the graph 
edit distance plays an important role in the quan- 
tification of graph similarity. After all, the extent to 
which the malware writer modifies a virus or reuses 
components should be reflected by the edit distance. 

Definition 4. (Graph similarity): The similarity 
(t(G, H ) between two graphs G and H indicates the 
extent to which graph G resembles graph H and vice 
versa. The similarity o(G,H) is a real value on 
the interval [0,1], where indicates that graphs G 
and H are identical whereas a value 1 implies that 
there are no similarities. In addition, the following 
constraints hold: cr(G, H) = a(H, G) (symmetry), 
a(G,G) = 0, and a(G,Ko) = 1 where Kq is the 
null graph, G ^ Kq. 

Before we can attend to the problem of graph 
similarity, we first have to revisit the definition of 
a graph matching as given in the previous Sec- 
tion. To find a bijection which maps the vertex 
set V(G) to V(H), the graphs G and H have to 
be of the same order. However, the latter is rarely 
the case when comparing call graphs. To circum- 
vent this problem, the vertex sets V{G) and V(H) 
are supplemented with dummy vertices e such that 
the resulting sets V'(G), V'(H) are both of size 
\V(G) + V(H)\ [SUEZ]. A mapping of a vertex w in 
graph G to a dummy vertex e is then interpreted as 
deleting vertex v from graph G, whereas the oppo- 
site mapping implies a vertex insertion into graph 
H. Now, for a given graph matching cj>, we can 
define three cost functions: VertexCost, EdgeCost 
and RelabelCost. 

VertexCost The number of deleted/inserted ver- 
tices: \{v : v e [V'(G) U V'{H)} A [(j>{v) = 
eV0(e)=«]}|. 

EdgeCost The number of unpreserved edges: 
\E{G)\ + \E{H)\ - 2 x : e E{G) A 

RelabelCost The number of mismatched func- 
tions, i.e. the number of external functions in 
G and H which are mapped against different 
external functions or local functions. 

The sum of these cost functions results in the graph 
edit distance A</,(G, if): 

\^(G,H) = VertexCost + EdgeCost + RelabelCost 

(1) 

Note that, as mentioned before, finding the mini- 
mum GED, i.e. min \<f>(G, H), is an NP-hard prob- 

lem, but can be approximated. The latter is elabo- 
rated in the next Subsection. 



Finally, the similarity tr(G, H) of two graphs is 
obtained from the graph edit distance A^(G, H): 

nir m _ A (G,iT) 

[ ' ) \V(G)\ + \V(H)\ + \E(G)\ + \E(H)\ 

(2) 

3.3 Graph edit distance approxima- 
tion 

Finding a graph matching ip which minimizes the 
graph edit distance is proven to be an NP-Complete 
problem [37] . Indeed, empirical results show that 
finding such a matching is only feasible for low or- 
der graphs, due to the time complexity [23]. In 
[2"T1 |2"0] , the performance of several graph matching 
algorithms for call graphs is investigated. Based on 
the findings in [5T|, the fastest and most accurate 
results are obtained with an adapted version of Sim- 
ulated Annealing; a local search algorithm which 
searches for a vertex mapping that minimizes the 
GED. This algorithm turns out to be both faster 
and more accurate than for example the algorithms 
based on Munkres' bipartite graph matching algo- 
rithm as applied in the related works [37] [T7J . Two 
steps can be distinguished in the Simulated Anneal- 
ing algorithm for call graph matching. In the first 
step, the algorithm determines which external func- 
tions a pair of call graphs have in common. These 
functions are mapped one-to-one. Next, the re- 
maining functions are mapped based on the out- 
come of the Simulated Annealing algorithm, which 
attempts to map the remaining functions in such a 
way that the GED for the call graphs under con- 
sideration is minimized. For more details, refer to 

4 Clustering 

Writing a malware detection signature for each in- 
dividual malware sample encountered is a cumber- 
some and time consuming process. Hence, to com- 
bat malware effectively, it is desirable to identify 
groups of malware with strong structural similari- 
ties, allowing one to write generic signatures which 
capture the commonalities of all malware samples 
within a group. This Section investigates several 
approaches to detect malware families, i.e. groups of 
similar malware samples, via clustering algorithms. 

4.1 fc-medoids clustering 

One of the most commonly used clustering tech- 
niques is /c-means clustering. The formal descrip- 
tion of fc-means clustering is summarized as follows 

mm-- 

Definition 5. (k-means Clustering) : Given a data 
set \, where each sample x £ \ is represented by a 
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vector of parameters, k-means clustering attempts 
to group all samples into k clusters. For each clus- 
ter Ci G C , a cluster center fid can be defined, 
where fid * s the mean vector, taken over all the 
samples in the cluster. The objective function of 
k-means clustering is to minimize the total squared 
Euclidean distance \x — fid 1 1 2 between each sample 
x € \, and the cluster center fid of the cluster the 
sample has been allocated to: 
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The above definition assumes that for each clus- 
ter, it is possible to calculate a mean vector, the 
cluster center (also known as centroid), based on 
all the samples inside a cluster. However, with a 
cluster containing call graphs, it is not a trivial pro- 
cedure to define a mean vector. Consequently, in- 
stead of defining a mean vector, a call graph inside 
the cluster is selected as the cluster center. More 
specifically, the selected call graph has the most 
commonalities, i.e. the highest similarity, with all 
other samples in the same cluster. This allows us 
to reformulate the objective function: 
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where a(G,H) is the similarity score of graphs G 
and H as discussed in Section [31 The latter algo- 
rithm is more commonly known as a k-medoids clus- 
tering algorithm |19) . where the cluster centers fid 
are referred to as 'medoids'. Since finding an exact 
solution in accordance with the objective function 
has been proven to be NP-hard [5J, an approxima- 
tion algorithm is used (Algorithm [l} . 

Algorithm 1: The k- medoids clustering algo- 

rithm 

Input: Number of clusters k, set of call graphs 
X- 

Output: A set of k clusters C 

l foreach Ci £ C do 

Initialize fid with an unused sample from 

x; 

repeat 

Classify the remaining \\\ — k call graphs. 
Each sample a; £ % is put in the cluster 
which has the most similar cluster medoid; 
foreach Ci € C do 
Recompute fid', 

7 until The objective function converges; 

8 return C = Co, Ci, Ck-i 

In [22], a formal proof on the convergence of fc- 
means clustering with respect to its objective func- 
tion is given. To summarize, the authors of [35] 



prove that the objective function decreases mono- 
tonically during each iteration of the fc-means al- 
gorithm. Because there are only a finite number 
of possible clusterings, the fc-means clustering al- 
gorithm will always obtain a result which corre- 
sponds to a (local) minimum of the objective func- 
tion. Since fc-medoids clustering is directly derived 
from fc-means clustering, the proof also applies for 
fc-medoids clustering. 

To initialize the cluster medoids, we use three dif- 
ferent algorithms. The first approach selects the 
medoids at random from \- Arthur and Vassilvit- 
skii observed in their work pQ that fc-means cluster- 
ing, and consequently also fc-medoids clustering, is a 
fast, but not necessarily accurate approach. In fact, 
the clusterings obtained through fc-means cluster- 
ing can be arbitrarily bad pQ. In their results, the 
authors of [TJ conclude that bad results are often 
obtained due to a poor choice of the initial clus- 
ter centroids, and hence they propose a novel way 
to select the initial centroids, which considerably 
improves the speed and accuracy of the fc-means 
clustering algorithm [TJ. In summary, the authors 
describe an iterative approach to select the medoids, 
one after another, where the choice of a new medoid 
depends on the earlier selected medoids. For a de- 
tailed description of their fc-means++ algorithm, 
refer to [1] . Finally, the last algorithm to select the 
initial medoids will be used as a means to assess 
the quality of the clustering results. To assist the 
fc-medoids clustering algorithm, the initial medoids 
are selected manually by anti- virus analysts. We 
will refer to this initialization technique as " Trained 
initialization" . 

4.1.1 Clustering performance analysis 

In this Subsection, we will test and investigate the 
performance of fc-medoids clustering, in combina- 
tion with the graph similarity scores obtained via 
the GED algorithm discussed in Section [3] The 
data set x we use consists of 194 malware call graph 
samples which have been manually classified by an- 
alysts from F-Secure Corporation into 24 families. 
Evaluation of the cluster algorithms is performed 
by comparing the obtained clusters against these 
24 partitions. To get a general impression of the 
samples, the call graphs in our test set contain on 
average 234 nodes and 488 edges. The largest sam- 
ple has 748 vertices and 1875 edges. Family sizes 
vary from 2 samples to 17 unique call graph sam- 
ples. 

Before fc-medoids clustering can be applied on the 
data collection, we need to select a suitable value 
for fc. Let k ptimai be the natural number of clus- 
ters present in the data set. Finding k opt i m ai is not 
a trivial task, and is analysed in depth in the next 
Subsection. For now, we assume that k op umal — 24; 
the number of clusters obtained after manual clas- 
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sification. Note however that k op umai depends on 
the cluster criteria used to obtain a clustering. In 
Figure^ the average distance d(xi,Hd) between 
a sample Xi in cluster d and the medoid of that 
cluster fia is plotted against the number of clus- 
ters in use. Each time A:-medoids clustering is re- 
peated, the algorithm could yield a different cluster- 
ing due to the randomness in the algorithm. Hence, 
for a given number of clusters k, we run fc-medoids 
clustering 50 times, and average d(xi, Hd)- When 
comparing the different initilization methods of k- 
medoids clustering, based on Figure [U one can in- 
deed conclude that fc-means+- 1- yields better re- 
sults than the randomly initialized fc-medoids algo- 
rithm because fc-means++ discovers tighter, more 
coherent clusters. Furthermore, the best results are 
obtained with Trained clustering where a member 
from each of the 24 predetermined malware families 
are chosen as the initial cluster medoids. 
Figures [3al I3bl depict heat maps of two possible clus- 
terings of the sample data. Each square in the heat 
map denotes the presence of samples from a given 
malware family in a cluster. As an example, cluster 
in figure I5al comprises 86% Ceeinject samples, 7% 
Runonce samples and 7% Neeris samples. The fam- 
ily names are invented by data security companies 
and serve only as a means to distinguish families. 
Figure [3a] shows the results of fc-medoids cluster- 
ing with Trained initialization. The initial medoids 
are selected by manually choosing a single sample 
from each of the 24 families identified by F-Secure. 
The clustering results are very promising: nearly 
all members from each family end up in the same 
cluster (Figure I3a[) . Only a few families, such as 
Baidu and Boaxxe, are scattered over multiple clus- 
ters. Figure [3b] shows the clustering results of k- 
means++ Q. Clearly, the clusterings are not as ac- 
curate as with our Trained fc-medoids algorithm; 
samples from different families are merged into the 
same cluster. Nevertheless, in most clusters samples 
originating from a single family are prominently 
present. Yet, before one can conclude whether fc- 
meansH — h clustering is a suitable algorithm to per- 
form call graph clustering, one first needs an au- 
tomated procedure to discover, or at the minimum 
estimate with reasonable accuracy, k pu m ai ■ 

The 

latter issue is investigated in the next Subsection. 

4.2 Determining the number of clus- 
ters 

The fc-medoids algorithm requires the number of 
clusters the algorithm should deliver as input. Two 
quality metrics are used to analyse the natural num- 
ber of clusters, k op timai, m the sample set: Sum of 

1 A similar figure for randomly initialized fc-medoids clus- 
tering is omitted due to its reduced accuracy with respect to 
fc-means++. 
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Figure 2: Quality of clusters. The average distance 
d(xi, nci) between a sample Xi in cluster Ci and the 
cluster's medoid fid is averaged over 50 executions 
of the fc-means algorithm. 

Error and the silhouette coefficient. For a more 
elaborate discussion, and additional metrics, refer 

to ng. 

4.2.1 Sum of (Squared) Error 

The Sum of Error (SE p ), measures the total 
amount of scatter in a cluster. The general formula 
of SE p is: 

k 

SE P = J2T,^^)) P (3) 

In this equation, d(x, y) is a distance metric which 
measures the distance between a sample and its 
corresponding cluster centroid (medoid) as a pos- 
itive real value. Here we choose d(xi, ^d) = 100 x 

Ideally, when one plots SE p against an increasing 
number of clusters, one should observe a quick de- 
creasing SEp on the interval [k = 1, k pti ma i\ and a 
slowly decreasing value on the interval [k op ti m ai > k = 

\x\] M- 

4.2.2 Silhouette Coefficient 

The average distance between a sample and its clus- 
ter medoid measures the cluster cohesion [33J . The 
cluster cohesion expresses how similar the objects 
inside a cluster are. The cluster separation on the 
other hand reflects how distinct the clusters are mu- 
tually. An ideal clustering results in well-separated 
(non-overlapping) clusters with a strong internal co- 
hesion. Therefore, k op ti m ai equals the number of 
clusters which maximizes both cohesion and sepa- 
ration. The notions of cohesion and separation can 
be combined into a single function which expresses 
the quality of a clustering: the silhouette coefficient 

E3i2g|. 



g 




1 1 | 

< 2 * 



11113 

u I o 5 5 



(a) trained fc-medoids clustering. 



I 



I ill 



III 
< J * 



S 1 j | i » | 



Malwarc family 
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Figure 3: Heat maps depicting non-unique cluster- 
ings of 194 samples in 24 clusters. The color of a 
square depicts the extent to which a certain family 
is present in a cluster. 

For each sample Xi £ let be the average 

similarity of sample Xi £ Ck in cluster Ck to all 
other samples in cluster Ck'- 

a(Xi) = — — (x t £ Ck) 
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Furthermore, let b k (xi), Xi £ Ck be the average sim- 
ilarity from sample Xi to a cluster Ck which does not 
accommodate sample Xi. 

b\x l ) = ^^^ (x^Ck) 

Finally, b(xi) equals the minimum such b k (xi): 
b(xi) = min b k (xi) k £ {0, 1, .., |C|} 

The cluster for which b k (xi) is minimal, is the sec- 
ond best alternative cluster to accommodate sample 
Xi. From the discussion of cohesion and separation, 
it is evident that for each sample Xi, it is desirable 
to have a(xi) <C b(xi) so to obtain a clustering with 
tight, well-separated clusters. 

The silhouette coefficient of a sample Xi is defined 
as: 

b{xi) - a(xi) 



s(xi) 



max{a{xi), b(xi)) 



(4) 



It is important to note that s(xi) is only defined 
when there are 2 or more clusters. Furthermore, 
s(x-i) = if sample Xi is the only sample inside its 
cluster [25] . 

The silhouette coefficient s(xi) in Equation 0] 
always yields a real value on the interval [—1,1]. 
To measure the quality of a cluster, the average 
silhouette coefficient over the samples of the re- 
spective cluster is computed. An indication of the 
overall clustering quality is obtained by averaging 
the silhouette coefficient over all the samples in x- 
To find koptimai, one has to search for a clustering 
that yields the highest silhouette coefficient. 
For a single sample Xi, s(xi) reflects how well the 
sample is classified. Typically, when s(xi) is close 
to 1, the sample has been classified well. On the 
other hand, when s(xi) is a negative value, then 
sample Xi has been classified into the wrong cluster. 
Finally, when s(xi) is close to 0, i.e. a(xi) ~ b(xt), 
it is unclear to which cluster sample Xi should 
belong: there are at least two clusters which could 
accommodate sample Xi well. 



4.2.3 Experimental results 

The SE p and silhouette coefficients obtained after 
clustering the 194 malware samples for various 
numbers of clusters are depicted in Figure 2) 
Since the results of the clustering are subject 
to the randomness in /c-medoids clustering, each 
clustering is repeated 10000 times, and the best 
obtained results are used in Figure |H Interestingly, 
the SE p curves for different values of p in Figure 
l4"al do not show an evident value for k opt imai- 
Similarly, no clear peak in the silhouette plot 
(Figure I4bl) is visible either, making it impossible 
to derive k op umai- Consequently, using a k- means 
based algorithm, it is not possible to partition all 
samples in cohesive, well-separated clusters based 
on the graph similarity scores, such that the result 
corresponds with the manual partitioning of the 
samples by F-Secure. Furthermore, experimental 
results show that for some samples it is unclear to 
which cluster they should be assigned too, hence 
making it difficult to automatically reproduce the 
24 clusters as proposed by F-Secure. 



4.3 DBSCAN clustering 

In the previous Section, we have concluded that the 
entire sample collection cannot be partitioned in 
well-defined clusters, such that each cluster is both 
tight and well-separated, using a fc-means based 
clustering algorithm. Central to the fc-medoid clus- 
tering algorithm stands the selection of medoids. 
A family inside the data collection is only correctly 
identified by fc-medoids if there exists a medoid with 
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Fi gure 4: Finding k ptimai in the set with 194 pre- 
classihed malware samples. 



a high similarity to all other samples in that family. 
This, however, is not necessary the case with mal- 
ware. Instead of assuming that all malware sam- 
ples in a family are mutually similar to a single 
parent sample, it is more realistic to assume that 
malware evolves. In such an evolution, malware 
samples from one generation are based on the sam- 
ples from the previous generation. Consequently, 
samples in generation n likely have a high similar- 
ity to samples in generation n + 1, but samples in 
generation are possibly quite different from those 
in generation n, n ^> 0. This evolution theory sug- 
gests that there are no clusters where the samples 
are positioned around a single center in a spheri- 
cal fashion, which makes it much harder for a fc- 
means based clustering algorithm to discover clus- 
ters. Although the fc-medoids algorithms failed to 
partition all 194 samples in well defined clusters, 
both Figure |3a] and Figure [3b] nevertheless reveal a 
strong correspondence between the clusters found 
by the fc-medoids algorithm and the clusters as pre- 
defined by F-Secure. This observation motivates us 
to investigate partial clustering of the data which 
discards samples for which it is not clear to which 
cluster or family they belong. For this purpose, 
we apply the Density-Based Spatial Clustering of 
Applications with Noise (DBSCAN) clustering al- 
gorithm [33l |TD] . DBSCAN clustering searches for 
dense areas in the data space, which are separated 



by areas of low density. Samples in the low den- 
sity areas are considered noise and are therefore dis- 
carded, thereby ensuring that the clusters are well- 
separated. An advantage of DBSCAN clustering is 
that the high density areas can have an arbitrary 
shape; the samples do not necessarily need to be 
grouped around a single center. 
To separate areas of low density from high density 
areas, DBSCAN utilizes two parameters: MinPts 
and Rad. Using these parameters, DBSCAN is 
able to distinguishes between three types of sam- 
ple points: 

• Core points: P c = {x G X>\ N Rad(x)\ > 
MinPts}, where 

N Ra d{x) = {y G x,<?(x,y) < Rad} 

• Border points: Pb — {x G (x\P c ),^y G P c : 
a(x, y) < Rad} 

• Noise points: P n = x\{Pc U P b ) 

An informal description of the DBSCAN clustering 
algorithm is given in Algorithm [21 

Algorithm 2: DBSCAN clustering algorithm 
Input: Set of call graphs x, MinPts, Rad 
Output: Partial clustering of x 

1 Classify \ m Core points, Border points and 
Noise; 

2 Discard all samples classified as noise; 

3 Connect all pairs (x, y) of core points with 
a(x,y) < Rad; 

4 Each connected structure of core points forms 
a cluster; 

5 For each border point identify the cluster 
containing the nearest core point, and add the 
border point to this cluster; 

6 return Clustering 



The question now arises how to select the pa- 
rameters MinPts and Rad. Based on experimental 
results, the authors of [lG find MinPts = 4 to be 
a good value in general. To determine a suitable 
value for Rad, the authors suggest to create a graph 
where the samples are plotted against the distance 
(similarity) to their fc-nearest neighbor in ascend- 
ing order. Here k equals MinPts. The reasoning 
behind this procedure is as follows: Core or Border 
points are expected to have a nearly constant simi- 
larity to their fc-nearest neighbor, assuming that fc 
is smaller than the size of the cluster the point re- 
sides in, and that the clusters are roughly of equal 
density. Noise points, on the contrary, are expected 
to have a relatively larger distance to their fc-ncarcst 
neighbor. The latter change in distance should be 
reflected in the graph, since the distances are sorted 
in ascending order. 
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Figure [5a] shows the similarity of our malware sam- 
ples to their fc-nearest neighbors, for various fc. Ar- 
guably, one can observe rapid increases in slope 
both at Rad = 2.2 and Rad = 4.8 for all fc. A 
Rad = 4.8 radius can be considered too large to 
apply in the DBSCAN algorithm since such a wide 
radius would merge several natural clusters into a 
single cluster. Even though Rad = 2.2 seems a 
plausible radius, it is not evident from Figure [5a] 
which value of Minpts should be selected. To cir- 
cumvent this issue, DBSCAN clustering has been 
performed for a large number of Minpts and Rad 
combinations (Figure I5bl) . For each resulting par- 
titioning, the quality of the clusters has been es- 
tablished with the silhouette coefficient. From Fig- 
ure [5b] one can observe that the best clustering is 
obtained for Minpts = 3 and Rad = 0.3. While 
comparing Figure [FBI against Figure [5a] it is not ev- 
ident why Rad = 0.3 is a good choice. We however 
believe that the silhouette coefficient is the more 
descriptive metric. 

Finally, FigureEOgives the results of the DBSCAN 
algorithm for Minpts — 3 and Rad — 0.3 in a fre- 
quency diagram. Each colored square gives the fre- 
quency of samples from a given family present in 
a cluster. The top two rows of the diagram rep- 
resent respectively the total size of the family, and 
the number of samples from a family which were 



categorized as noise. For example, the Boaxxe fam- 
ily contains 17 samples in total, which were divided 
over clusters 1 (14 samples), 6 (1 sample), and 17 
(2 samples). No samples of the Boaxxe family were 
classified as noise. The observation that the Boaxxe 
family is partitioned in multiple clusters is anal- 
ysed in more detail; closer analysis of this family 
revealed that there are several samples within the 
family with a call graph structure which differs sig- 
nificantly from the other samples in the family. 
The results from the DBSCAN algorithm on the 
malware samples are very promising. Except 3 clus- 
ters, each cluster identifies a family correctly with- 
out mixing samples from multiple families. Fur- 
thermore, the majority of samples originating from 
larger families were classified inside a cluster and 
hence were not considered noise. Families which 
contain fewer than Minpts samples are mostly clas- 
sified as noise (e.g. Vundo, Blebloh, Startpage, etc), 
unless they are highly similar to samples from dif- 
ferent families (e.g. Autorun). Finally, only the 
larger families Veslorn (8 samples) and Redosdru 
(9 samples) were fully discarded as noise. Closer 
inspection of these two families indeed showed that 
the samples within the families are highly dissimilar 
from a call graph perspective. 

Finally, Figure [7] depicts a plot of the diameter and 
the cluster tightness, for each cluster in Figure [5] 
The diameter of a cluster is defined as the similarity 
of the most dissimilar pair of samples in the cluster, 
whereas the cluster tightness is the average similar- 
ity over all pairs of samples. Most of the clusters 
are found to be very coherent. Only for clusters 
2, 6, and 7, the diameter differs significantly from 
the average pairwise similarity. For clusters 2 and 6, 
this is caused by the presence of samples from 2 dif- 
ferent families which are still within Rad distance 
from each other. Cluster 7 is the only exception 
where samples are fairly different and seem to be 
modified over multiple generations. Lastly, a spe- 
cial case is cluster 16, where the cluster diameter 
is 0. The call graphs in this cluster are isomorphic; 
one cannot distinguish between these samples based 
on their call graphs, even though they come from 
different families. Closer inspection of the samples 
in cluster 16 by F-Secure Corporation revealed that 
the respective samples are so-called 'droppers'. A 
dropper is an installer which contains a hidden mali- 
cious payload. Upon execution, the dropper installs 
the payload on the victim's system. The samples in 
cluster 16 appear to be copies of the same dropper, 
but each with a different malicious payload. Based 
on these findings, the call graph extraction has been 
adapted such that this type of dropper is recognized 
in the future. Instead of creating the call graph from 
the possibly harmless installer code, the payload is 
extracted from the dropper first, after which a call 
graph is created from the extracted payload. 
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Figure 6: DBSCAN clustering with Minpts = 3, 
Rad = 0.3. The colors depict the frequency of oc- 
currence of a malware sample from a certain family 
in a cluster. 



0.5 p 

Average cluster a(i) [ 




2 4 6 8 10 12 14 16 

Cluster ID 



Figure 7: Plot of the diameter and tightness of the 
DBSCAN clustering. 



and a brief description on the extraction of call 
graphs from malware samples, Section 3 discusses 
methods to compare call graphs mutually. Graph 
similarity is expressed via the graph edit distance, 
which, based on our experiments, seems to be a vi- 
able metric. To facilitate the discovery of malware 
families, Section @] applies several clustering algo- 
rithms on a set of malware call graphs. Verification 
of the classifications is performed against a set of 
194 unique malware samples, manually categorized 
in 24 malware families by the anti- virus company 
F-Secure Corporation. The clustering algorithms 
used in the experiments include various versions of 
the fc-medoids clustering algorithm, as well as the 
DBSCAN algorithm. One of the main issues en- 
countered with /c-medoids clustering is the specifi- 
cation of the desired number of clusters. Metrics to 
determine the optimal number of clusters did not 
yield conclusive results, and hence it followed that 
fc-means clustering is not effective to discover mal- 
ware families. 

Much better results on the other hand are obtained 
with the density-based clustering algorithm DB- 
SCAN; using DBSCAN we were able to success- 
fully identify malware families. At the date of writ- 
ing, automated classification is also attempted on 
larger data sets consisting of a few thousand sam- 
ples. However, manual analysis of the results is a 
time consuming process, and hence the results could 
not be included in time in this paper. Future goals 
are to link our malware identification and family 
recognition software to a live stream of incoming 
malware samples. Observing the emergence of new 
malware families, as well as automated implementa- 
tion of protection schemes against malware families 
belong to the long term prospects of malware de- 
tection through call graphs. 



5 Conclusion 

In this paper, automated classification of malware 
into malware families has been studied. First, met- 
rics to express the similarities among malware sam- 
ples which are represented as call graphs have been 
introduced, after which the similarity scores are 
used to cluster structurally similar malware sam- 
ples. Malware samples which are found to be very 
similar to known malicious code, are likely muta- 
tions of the same initial source. Automated recog- 
nition of similarities as well as differences among 
these samples will ultimately aid and accelerate the 
process of malware analysis, rendering it no longer 
necessary to write detection patterns for each indi- 
vidual sample within a family. Instead, anti-virus 
engines can employ generic signatures targeting the 
mutual similarities among samples in a malware 
family. 

After an introduction of call graphs in Section 2 
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